Kinesis

Kinesis Data Streams
- Low latency streaming ingest at scale
Kinesis Data Analytics
- Perform real-time analytics on streams using SQL
Kinesis Data Firehose
- Load streams into S3, redshift, ElasticSearch and Splunk

Kinesis Data Streams

Streams are divided in ordered Shards
Ability to reprocess / replay data
- Multiple applications can consume the same stream
Real-time processing with scale of throughput
Once data is inserted in Kinesis, it can’t be deleted (data in Kinesis is immutability)

Key Concepts

Shards

The number of shards can evolve over time (reshard / merge)
Records are only ordered per shard
Billing is per shard provisioned

Records

Data Blob
- Data being sent, serialized as bytes
- Up to 1 MB
Record Key
- Helps to group records in Shards (same key in same shard)
- Should use highly distributed key to avoid the hot partition problem
Sequence number
- Unique ID for records put in shards
- Added by Kinesis after ingestion

Limits

Producer
- 1MB/s or 1000 records/s at write per shard
- ProvisionedThroughputException otherwise
Consumer Classic
- 2MB/s at read per shard across all consumers
- 5 API calls/s per shard across all consumers
Consumer Enhanced Fan-Out
- 2MB/s at read per shard per enhanced consumer
- No API calls needed
Data Retention
- 24 hours by default
- Up to 7 days

Kinesis Producers

Kinesis Producer SDK
Managed AWS sources
Kinesis Producer Library (KPL)
Kinesis Agent
3rd party libraries

Kinesis Producer SDK

PutRecord
PutRecords
- Use batching and increases throughput

Solutions for ProvisionedThroughputExceeded Exceptions

Retries with backoff
Increase (scale up) shards
Use highly distributed partition key

Managed AWS Sources

CloudWatch Logs
AWS IoT
Kinesis Data Analytics

Kinesis Producer Library (KPL)

C++ / Java library
high performance, long-running producers
Automated and configurable retry mechanism
Sync or Async (better performance) API
Submits metrics to CloudWatch
Batching (default) - increase throughput, decrease cost
- Collect records and write to multiple shards in the same PutRecords API call
- Aggregate - but increase latency
  - Store multiple records in one record (go over 1000 records/s limit)
  - Increase payload size and improve throughput (maximize 1MB/s limit)
- Configured with RecordMaxBufferedTime (100ms by default)
Limitation
- Do not support auto compression
- KPL Records can only be decoded with KCL or special helper library

Kinesis Agent

Monitor Log files and sends them to Kinesis Data Streams
Only support Linux-based servers
Can write form multiple directories and write to multiple streams
Can route based on directory or log file
Can pre-process data before sending to streams (single line, json, etc.)
Handles file rotation, checkpointing and retry
Can emit metrics to CloudWatch

Kinesis Consumers Classic

Kinesis Consumer SDK
Kinesis Client Library (KCL)
Kinesis Connector Library
3rd party libraries (Spark, etc.)
Managed AWS sources
- Kinesis Firehose
- AWS Lambda

Kinesis Consumer SDK

GetRecords
- Returns up to 10MB of data (then throttle for 5s) or up to 10,000 records
  - Maximum of 2MB/s at read per shard
  - Maximum of 5 GetRecords API calls/s per shard (200ms latency)

Kinesis Client Library (KCL)

Support Java, etc.
Read records from Kinesis produced with the KPL (de-aggregation)
Shard discovery: share multiple shards with multiple consumers in one group
Checkpointing: resume progress
Leverages DynamoDB for coordination and checkpointing
- Use provisioned DynamoDB with enough WCU and RCU
- Use On-Demand DynamoDB
- Otherwise DynamoDB may slow down KCL

Kinesis Connector Library

Have been deprecated
Can only be running on EC2
Write data to S3, DynamoDB, Redshift, ElasticSearch
Can be replaced by Kinesis Firehose or Lambda

AWS Lambda Sourcing from Kinesis

Lambda consumer has a library to de-aggregate record from the KPL
Lambda can be used to rum lightweight ETL to anywhere you want
Lambda can be used to trigger notifications or send emails in real time
Lambda has a configurable batch size to regulate throughput

Kinesis Enhanced Fan-Out

Kinesis pushes data to consumers over HTTP/2
Each Consumer get 2 MB/s of providioned through per shard
- No more 2MB/s limit by adding Consumers
- Reducy latency (~70ms)

Enhanced Fan-Out vs Standard Consumers

Enhanced Fan-Out	Standard Consumers
Low number of consuming applications	Multiple Consumer applications for the same Stream
Can tolerate ~200 ms latency	Low latency requirements ~70 ms
Minimize cost	Higher costs (Soft limit of 5 consumers using enhanced fan-out per data stream)

Kinesis Scaling

Shard Splitting

Can be used to increase the Stream capacity (1MB/s per shard)
Can be used to divided a hot shard
The old shard is closed and deleted based on data expiration

Shard Merging

Decrease the Stream capacity and save costs
Group two shards with low traffic
Old shards are closed and deleted based on data expiration

No Native Auto Scaling

Can implement Auto Scaling with AWS Lambda
The API call to change the number of shards is UpdateShardCount

Limitations

Resharding cannot be done in parallel (have to plan capacity in advance)
Can only perform one resharding at a time and it takes a few seconds
- Double 1000 shards takes about 8.3 hours

Kinesis Security

Control access / autorization using IAM policies
Encryption in transit using HTTPS endpoints
Encryption at rest using KMS
Client side encryption can only be implemented manually
VPC Endpoints is available for Kinesis

Kinesis Data Firehose

Near Real Time (60s latency minimum for non-full batches)
Read data from
- SDK or KPL
- Kinesis Agent
- Kinesis Data Streams
- CloudWatch Logs & Events
- IoT rules actions
Load data into
- S3
- Redshift
- ElasticSearch
- Splunk
Spark or KCL cannot read from Firehose
Auto scaling
Support Data Conversions
- JSON to Parquet
- JSON to ORC (only for S3)
Support Data Transformation through AWS Lambda
Support compression (only for S3)
- GZIP, ZIP, Snappy
- Only GZIP when data is further loaded into Redshift
Only pay for the amount of data going through Firehose

Delivery Features

Source Records
Transformation Failures
Delivery Failures

can be stored directly into S3

Buffer Sizing

When buffer is reached, it’s flushed
- Buffer Size
- Buffer Time
Automatically increase the buffer size to increase throughput
High throughput: more buffer size
Low throughput: less buffer time

Kinesis Data Streams vs Firehose

Streams	Firehose
Custom consumer code	Fully managed, only send to S3, Redshift, ElasticSearch, Splunk
Real time (~200ms for classic or ~70ms for enhanced fan-out)	Near real time (1min~ buffer time)
Manual Scaling (shard splitting / merging)	Auto Scaling
Data Storage for 1 to 7 days	No data storage

AWS SQS

SQS - Standard Queue

Auto Scales without limit
Retention of messages from 1 minute to 14 days, 4 days by default
the number of messages in the queue has no limit
Low latency (~10 ms on publish and receive)
Consumers can be scaled horizontally
Can have duplicate messages
Can have out of order messages
Limitation of 256KB per message

SQS - FIFO Queue

Name of the queue ends in .fifo
Up to 3,000 messages/s with batching or 300 messages/s without (soft limit)
Messages are sent only once
Messages are processed in order

Producing Messages

Message body
- String
- Up to 256KB
Message attributes (metadata)
- Key-value pair
Delivery delay
Return
- Message ID
- MD5 hash of the body

Consuming Messages

Pull messages from SQS
- Up to 10 messages at a time
Process the message within the visibility timeout
Delete the message by message ID & receipt handle
Maximum of 120,000 messages being processed by consumers

SQS Extended Client

Send messages larger than 256KB
Not recommended
Java Library
- Send message to S3 first
- Only send metadata to SQS

Use Cases

Decouple applications asynchronously
Buffer writes
Handle large loads of messages

Pricing

Pay per API request
Pay network usage

Security

HTTPS endpoints
Server Side Encryption using KMS
- Only encrypts the body
IAM
SQS queue access policy

AWS IoT

Device Gateway

The entry point for IoT devices connecting to AWS IoT
Supports the MQTT, WebSockets, and HTTP 1.1 protocols

Message Broker

For devices to communicate with others
Pub/sub messaging pattern with low latency
Supports the MQTT, WebSockets, and HTTP 1.1 protocols

Rules Engine

Rules are defined on the Broker topics
Rule-action pairs
Used to augment or filter data received from devices
Rules need IAM Roles to perform the actions

Thing Registry

Represent all connected devices
- A unique ID
- Metadata
- X.509 certificate
- IoT Groups
Can organizes the resources associated with each device

Authentication

Things
- X.509 certificates
- AWS SigV4
- Custom tokens with custom authorizers
Mobile apps
- Cognito
Web/Desktop/CLI
- IAM
- Federated Identities

Authentication Policies

IoT policies
- Attached to X.509 certificates or Cognito Identities
- Can be attached to Groups in stead of Things
IAM policies
- Used for controlling IoT AWS APIs

Device Shadow

Record the state of Things and synchronize the state when Things back up online
JSON document

Greengrass

Brings the compute layer to the devices locally
Execute AWS Lambda on the devices and operate offline
Deploy functions from the cloud

Database Migration Service (DMS)

Migrate databases to AWS
The source database remains available during the migration
Must run an EC2 instance to perform Continuous Data Replication task using Change Data Capture (CDC)

Sources and Targets

Sources
- On-Premise or EC2 instances databases
- RDS
- S3
Targets
- On-Premise or EC2 instances databases
- RDS, Redshift, DynamoDB
- S3
- ElasticSearch, Kinesis Data Streams, DocumentDB

Schema Conversion Tool (SCT)

A tool to convert database schema from one to another
Can use SCT to create DMS endpoints and tasks

Different between DMS and SCT

DMS	SCT
Migrate smaller relational workloads (< 10 TB) and MongoDB	Migrate large data warehouse workloads
Support ongoing replication to keep the target in sync	Do not support

Direct Connect

Provides a dedicated private connection from a remote network to your public resources or VPCs
- Need a Virtual Private Gateway on your VPC
Can setup multiple 1 Gbps or 10 Gbps connections
Connect to Direct Connect endpoint in Direct Connect location
Use Cases
- Increase bandwidth throughput
- Real-time data feeds
- Hybrid Environment
- Enhanced security using private connection
Use two Direct Connect as a failover
Use Direct Connect Gateway if you want to connect to VPCs in different regions

Snowball

Physical device to move TBs to PBs of data in or out of AWS
Secure by using KMS 256 bit encryption
Use Snowball if it takes more than a week to transfer over the network

Snowball Edge

More capacity (100 TB) with computational capability
- Storage optimized
- Compute optimized
Can pre-process the data while moving
- Support EC2 AMI
- Support Lambda functions

Snowmobile

100 PB of capacity
Batter than Snowball if more than 10 PB